Named Entity Recognition using Support Vector Machine: A Language Independent Approach
نویسندگان
چکیده
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes , defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) . In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages. Keywords—Named Entity (NE); Named Entity Recognition (NER); Support Vector Machine (SVM); Bengali; Hindi.
منابع مشابه
A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملLanguage and Domain-Independent Named Entity Recognition: Experiment using SVM and High-Dimensional Features
This paper presents the results of experiments aiming to explore a baseline solution for the named entity recognition problem using a language-independent, domainindependent approach. The first domain chosen for this experiment is the biomedical publications domain, especially selected due to its importance and inherent challenges. A supervised learning approach using Support Vector Machines (S...
متن کاملA discrete Kernel Approach to Support Vector Machine Learning in Language Independent Named Entity Recognition
In this paper we discuss a discrete kernel approach to Support Vector Machine (SVM) learning to do Language Independent Named Entity Recognition (LINER). The kernel we use is called the polynomial overlap kernel (POK). It is derived from a distance function that has been succesfully used in memory-based learning for various natural language problems. The POK is a discrete function and it works ...
متن کاملUsing Language Independent and Language Specific Features to Enhance Arabic Named Entity Recognition
The Named entity recognition task has been garnering significant attention as it has been shown to help improve the performance of many natural language processing applications. More recently, we are starting to see a surge in developing named entity recognition systems for languages other than English. With the relative abundance of resources for the Arabic language and a certain degree of mat...
متن کاملAddressing Scalability Issues of Named Entity Recognition Using Multi-Class Support Vector Machines
This paper explores the scalability issues associated with solving the Named Entity Recognition (NER) problem using Support Vector Machines (SVM) and high-dimensional features. The performance results of a set of experiments conducted using binary and multi-class SVM with increasing training data sizes are examined. The NER domain chosen for these experiments is the biomedical publications doma...
متن کاملArabic Named Entity Recognition: an Svm-based Approach
The Named Entity Recognition (NER) task has been garnering significant attention as it has been shown to help improve the performance of many Natural Language Processing (NLP) applications. More recently, we are starting to see a surge in developing NER systems for languages other than English. With the relative abundance of resources for the Arabic language and a certain degree of maturation i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012